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Abstract 

Variable selection has received widespread attention over the last decade as we routinely encounter 
high-throughput datasets in complex biological and environment research. Most Bayesian vari¬ 
able selection methods are restricted to mixture priors having separate components for charac¬ 
terizing the signal and the noise. However, such priors encounter computational issues in high 
dimensions. This has motivated continuous shrinkage priors, resembling the two-component pri¬ 
ors facilitating computation and interpretability. While such priors are widely used for estimating 
high-dimensional sparse vectors, selecting a subset of variables remains a daunting task. In this 
article, we propose a general approach for variable selection with shrinkage priors. The presence 
of very few tuning parameters makes our method attractive in comparison to adhoc thresholding 
approaches. The applicability of the approach is not limited to continuous shrinkage priors, but can 
be used along with any shrinkage prior. Theoretical properties for near-collinear design matrices 
are investigated and the method is shown to have good performance in a wide range of synthetic 
data examples. 
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1. INTRODUCTION 


Variable selection in high-dimensional models has received substantial interest in recent years [Fan 


& Lv[[20T0| and is a challenging problem for Bayesians. With rapid advances in data acquisition 
and storage techniques, modern scientific investigations in epidemiology, genomics, imaging and 
networks are increasingly producing more variables compared to the sample size. One hope for 
meaningful inferences in such situations is to discover a subset of variables that explains the phys¬ 
ical or biological process generating the data. Exploiting such underlying structures, commonly 
prevalent in the form of sparsity of model parameters, holds the key to meaningful inferences in 
high-dimensional settings. This article revisits the problem of Bayesian variable selection in the 
context of Gaussian linear models ([T]) using shrinkage priors: 


V = X/? + e, e-Ar(0,a%), 


( 1 ) 


where V is an n-dimensional response observed with respect to the n x p covariate matrix X and 
is the p-dimensional coefficient vector. Traditionally, to select the important variables out of 


Xi,, Xp, a two component mixture prior (also referred to as a spike-and-slab prior) [George 


& McCulloch[ [1993| [1997[ [Mitchell & Beauchamp| [1988| is placed on /3. These priors include a 
mass or a spike at zero characterizing the noise and a continuous density (usually centered at zero) 
representing the signal density. Although these priors are highly appealing in allowing separate 
control of the level of sparsity and the size of the signal coefficients, they lead to computational 
hurdles in high-dimensions due to the need to explore a 2^ model space. [Johnson & Rossell [ 2010[ 
recently showed a startling selection inconsistency phenomenon for several commonly used spike- 


and-slab priors based on intrinsic Bayes factors [Berger & Pericchi, 1996 j, fractional Bayes factors 


[ O’Hagan 1995 j, and g-priors [ Liang et al.H 2008 j when the model size p > ^/n. This behavior 
was attributed to the common practice of centering the prior on the signal component at zero (local 
prior), which obliterates the demarcation between the signal and noise in high dimensions and leads 
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to negligible posterior probability being assigned to any given model. [Johnson & Rossell] [ |2010 | 
advocated the use of non-local priors to obtain selection consistency when p = 0{n), where the 
density for the signals decays to zero in a neighborhood of the origin. When p S> n, it is not 
immediately clear whether non-local prior distributions can provide sufficient distinguishability 
between the signals and the noise coefficients. 

Nevertheless, the practical problem of selecting variables has been a major bottleneck even 
with spike-and-slab priors. Although the highest posterior probability model (HPPM) is com¬ 
monly perceived as the best model [ |Clyde[ |1999[ [Clyde & George[ [1999[ , it is not optimal for 
prediction [ [Barbieri & Berger , [2004[ since HPPM is the Bayes estimate only under 0-1 loss func¬ 
tion. Moreover, finding HPPM in high-dimensions is computationally demanding since the MCMC 
can only visit a minute fraction of the 2^ model space even for a relatively large number of Gibbs 
iterations. To circumvent these issues, Barbieri & Berger p004 [ proposed the median probability 
model (MPM) defined as the model consisting of those variables which have an overall posterior 
probability of inclusion greater than or equal to 1/2. Although this is the optimal predictive model. 


Ghosh & Ghattas [20141 found that summaries of the posterior distribution based on marginal and 


joint distributions may give conflicting results for assessing the importance of strongly correlated 
covariates. 

Computational issues and considerations that many of the fijS may be small but not exactly zero 


have led to a rich variety of continuous shrinkage priors being proposed recently [Bhattacharya 


et al.[[20T4| [Carvalho et al.[ [2009[ [20T0| [Griffin & Brown[[2010[[Park & Casella[ [2008[ [Tippingl 


20011, which can be unified through a global-local (GL) scale mixture representation of Poison & 


Scott [20l0[ below. 


~ N(0, Aj-r), r~/, Xj 


9 , 


( 2 ) 


where / and g are densities on the positive real line. In Q, r controls global shrinkage towards 
the origin while the local parameters AjS allow deviations in the degree of shrinkage. Special cases 
include Bayesian lasso [Park & Casella 20081, relevance vector machine [Tipping[ 20011, normal- 
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gamma mixtures [Griffin & Brown 20101 and the horseshoe [Carvalho et al. 2009 2010 j among 
others. GL priors potentially have substantial eomputational advantages over mixture priors, sinee 
the normal seale mixture representation allows for eonjugate updating of (3 and A in bloeks. More¬ 
over, a number of frequentist regularization procedures such as ridge, lasso, bridge and elastic net 
correspond to posterior modes under GL priors with appropriate choices of / and g. 

The literature on model selection with continuous shrinkage priors is even less-developed due 
to the unavailability of exact zeros in the posterior samples of (3. Heuristic methods based on 
thresholding the posterior mean/median of f3 are often used in practice which lack theoretical 
justification, and inference is highly sensitive to the choice of the threshold. There is a recent 


literature on decoupling shrinkage and selection [Bondell & Reich 2012 Hahn & Carvalho 2015 


Vehtari & Lampinen] 2002 j, which poses the problem of selection as a loss function based decision 


rather than inducing sparsity through a prior distribution. Another naive way to select variables 
using a shrinkage prior is to see whether the posterior credible interval contains zero or not. Such 
a method usually has a poor performance because it is very difficult to estimate the uncertainty 
accurately in high dimensional problems. 

In this article, we aim to address the problem of selecting variables through a novel method of 
post processing the posterior samples. The approach is based on first obtaining a posterior distribu¬ 
tion of the number of signals by clustering the signal and the noise coefficients and then estimating 
the signals from the posterior median. This simple approach requires very few tuning parameters 
and is shown to have excellent performance relative to both existing frequentist and Bayesian ap¬ 
proaches. Moreover, the method is not only applicable to continuous shrinkage priors, but also can 
be used along with any shrinkage prior for (3 after a full MCMC run. For the ease of exposition, 
we focus on the spike-and-slab prior and the horseshoe prior and compare the performances using 
HPPM, MPM and the credible set approach for variable selection. Interestingly, in the presence of 
high collinearity among the covariates, we demonstrated better performance when the horseshoe 
prior is used in conjunction with our selection procedure. 

The organization of the present paper is as follows. Section [^describes the variable selection 
algorithm. Theoretical properties for collinear design matrices are considered in Sectionj^ Section 
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[^contains detailed comparisons in synthetic data. A discussion is provided in Section 

2. METHODOLOGY 

Our objective is to develop an algorithm to select the important variables based on the posterior 
samples of obtained from the Markov Chain Monte Carlo (MCMC) samples in ([T]) when a 
shrinkage prior is placed on {3. The algorithm is independent of the prior for f3, but dependent on 
the linear model with additive error assumption in ([T]). Unlike existing approaches, the method 
involves very few tuning parameters, hence readily suitable for future use of practitioners. Our 
idea is based on finding the most probable set of variables in the posterior median of jS. Since 
the distribution of the number of important variables is more stable and is largely unaffected by 
the mixing of the MCMC, we propose to first find the mode H of the distribution of number of 
important variables and then select the H largest coefficients from the posterior median of \f3j\. 

2 .1 2-means (2-M) variable selection 

We expect two clusters of |/3jj, with one concentrated closely near zero corresponding to noise 
variables and the other one away from zero corresponding to the signals. As an automated ap¬ 
proach, we cluster \f3j\s at each MCMC iteration using k-means with k = 2 clusters. For the Ah 
iteration, the number of non-zero signals hi is then estimated by the smaller cluster size out of 
the two clusters. A final estimate (H) of the number of non-zero signals is obtained by taking the 
mode over all the MCMC iterations, i.e., H = mode{hj}. The H largest entries of the posterior 
median of |/3| are identified as the non-zero signals. 

When the true coefficient vector has signal coefficients varying in the signal strengths, 2- 
M variable selection approach described above may inappropriately cluster the smaller signals 
together with the noise variables. A possible solution is to use different values of the number of 
clusters k, but it is usually difficult and time-consuming to find the optimal value of k. To deal 
with this, we propose a simple modification called the sequential 2-means below. 
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2.2 Sequential 2-means (S 2 M) variable seleetion 

We start with a few notations. Define St to be the indices of non-zero signals in (3t and to be 
the indices of the selected covariates. To assess the efficacy of a variable selection procedure, we 
introduce two types of errors a) |S't fl masking error (also called ‘false negatives’), and b) 
\Se n S'!,!: swamping error (also called ‘false positives’). 

When I3t has different levels of signal strengths, the 2-M variable selection approach will have 
a high chance of incurring masking error. In other words, it is highly likely that some true signals 
with low signal strengths will be clustered with the noise coefficients even when the corresponding 
l3jS are estimated well. Our main motivation to propose the sequential 2-means (S 2 M) variable 
selection approach is to reduce the probability of masking error. Let 6 > 0 be a tuning parameter, 
then S 2 M is defined as below. At the Ah iteration of MCMC: 

I perform a 2-means on \(3j\^j = 1,2,... ,p. Denote the two cluster means by m and M 
(m < M). Initialize a set A with an empty set. While the difference M — mis greater than b: 

(a) update A to be all the indices from the cluster with the lower mean m; 

(b) perform a 2-means on \/3j \ ,j E A; 

(c) update m and M to be two cluster means (m < M) obtained from (b). 

II The set A is considered to contain coefficients of noise covariates. So the estimated number 
of signals hi is p — \A\. 

The above algorithm is repeated for all MCMC samples of 13 and the final estimates of the number 
of signals and the coefficients corresponding to signals and the noise variables are obtained in the 
same way as in the 2-M algorithm. 

Using an appropriate tuning parameter h, S 2 M is capable of reducing the chance of masking 
error. However, note that this occurs at the cost of an increased probability of swamping error. A 
larger value of b tends to increase the masking error, while a smaller value of b increases swamping 
error. Hence one should choose b so that the sum of the two errors is minimized. In order to assess 
the important factors influencing the choice of b, let us first consider a noise-free version of the 
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model in ([T]) and hypothesise the ideal situation when the Bayesian proeedure provides exactly 
accurate estimation of /3r at every iteration. Then it is not hard to conclude any value of b between 
0 and the lowest absolute signal strength leads to correct variable selection. However, in presence 
of noise, it is impossible for any method to produce an exact estimate of (3t- In addition, using 
continuous shrinkage priors, the estimated coefficients for the noise variables will never be exactly 
zero. As the noise level a increases, the estimates for the coefficients become more variable. This 
makes it more likely for the non-zero coefficients of lower signal strength to be clustered with the 
noise coefficients leading to an increase in the masking error. Hence we believe a proper value 
for b should be based on the posterior estimate of a. However, the estimate of cr in ([T]) is affected 
by i) the true noise level ii) collinearity in the design matrix iii) as well as the ill-posed-ness of 
the high-dimensional regression problem i.e., how large p is compared to n. These key factors 
contribute to the accuracy of the selection procedure. 

The tuning parameter b should be chosen to be an increasing function of the estimated cr^ to 
take into account the increased variability in the estimates of the noise coefficients. The variable 
selection results using S 2 M approach with a large threshold will surely be no worse than that 
the previously stated 2-M approach. Through various simulation with different settings, we have 
observed using 2 times posterior median of cx^ (b = 2a‘^) results in accurate estimation of the 
number of signals H. Using very often results in masking error, while using always selects 
many noise coefficients to be signals. In practice, we suggest to use b = 2(j^ to reduce both 
masking and swamping errors. A higher value for b might be necessary if the number of selected 
active covariates obviously exceeds the expected number of signals. On the other hand, a smaller 
value of b would be desirable if more covariates are expected to be active. 


3. DEALING WITH CORRELATED PREDICTORS USING CONTINUOUS SHRINKAGE 

PRIORS 


In presence of confounders which are highly correlated with an important predictor, it is crucial 
that a variable selection method can identify the true predictor. [Bhattacharya et alT [2014 20121 
recently showed that a global-local shrinkage prior Q achieves better concentration around sparse 
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vectors in comparison with shrinkage prior based on only global scale i.e., setting ijjj = 1 in 
In this section, we show that such observations extend to the case of variable selection. While 
a “global-only” shrinkage prior fails to select the true variables under moderate correlation, an 
appropriately constructed global-local shrinkage prior can achieve desirable variable selection even 
under high correlation. 

For the ease of understanding the behavior of continuous shrinkage priors under correlation, 
consider only two covariates where variable 1 is the important predictor and variable 2 is the con- 
founder. Let X'X = [1 p; p 1] with p characterizing the correlation between the two predictors. 
Assume $mlej and /3sj are the maximum likelihood estimate and posterior mean of /3j,j = 1, 2 
respectively. We empirically observe that MLE underestimates the signal coefficient (3i and over¬ 
estimates the confounder (32 under high correlation. Ideally, a shrinkage prior should counter¬ 
balance this effect allowing the corresponding posterior estimates to be well-separated, thus facili¬ 
tating variable selection. We define a terminology called reverse-shrinkage to describe this. A prior 
is said to satisfy reverse-shrinkage if I > |/3 mle, 2| implies \l3s,i/i3sg\ > 0mle,i/^mleM- 
Suppose we can write a Bayes estimator under a shrinkage prior as a function of the MLE: 
(3s,j = (1 — Si{(3MLE))(3MLE,j, j = 1,2, ...,p, where the shrinkage Sj is less than 1. Then 
reverse-shrinkage means a larger magnitude of MLE results in a smaller shrinkage. Clearly, this is 
a desirable phenomenon for any variable selection approach. 

Theorem 3.1. Suppose (3i N{0, i = 1,2 in and the n X 2 covariate X satisfies 

X'X = [Ip; pi] where p G (0,1). Then if\(3MLE,i\ > \(3mle,2\, 


I^N,l 


fiMLE,l 

l^N,2 


fiMLE,2 


for any r and n> 2, where (3si denotes the posterior mean. 

Hence, when for correlated predictors, shrinkage priors with only global shrinkage parameters 
are no better than using MEEs. Next, we turn our attention to global-local shrinkage priors. We 
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focus on the horseshoe prior for a fixed value of the global shrinkage parameter r 


I ^ ~ ^3 ~ ^Ca+(0,1) 


(3) 


where Ca’''(0,1) denotes the standard half-Cauehy distribution with pdf 2/{7r(l + for x > 0. 
With X'X defined in Theorem |3.1i we write the HS estimators as functions of MLEs in Lemma 
(see Appendix). More preeisely, the HS estimators are funetions of p, r, $mle,2, and A = 


6.4 


(3mle,i/(3mle,2 


. Given values for these parameters, we calculate the approximate values of 


I3hs,i/$hs,2 and /3mle,i/$mle,2 in Matlab to see whether reverse-shrinkage oeeurs. 

Through the following figure, we will show the horseshoe prior ean be made to satisfy the 
reverse-shrinkage property by suitably ehoosing r. Figure provides numerieal analysis with 
Pmle,2 = 1 and 1.5, for different values of p G [0.94, 0.99], r G (0,1), and A > 1. Blue / red dots 
indieate reverse-shrinkage / laek of it. Figure shows reverse-shrinkage is more likely to occur 
when there are a smaller value of p, a greater value of A (these two observations are expeeted) and 
a smaller value of r present. The p and A are direetly dependent on data, however, by choosing r 
carefully ehosen, it is possible to inerease the possibility of aehieving reverse-shrinkage. Clearly, 
for values of p close to 1, the horseshoe prior with large values of r is less prone to aehieve the 
reverse-shrinkage eompared to smaller values of r. In praetiee, we suggest to have an upper bound 
for the global hyperparameter r when updating it in a sampler. 


4. SIMULATION STUDY 

Our prineiple goal in this seetion is to eompare the performanee of the methods we proposed, i.e. 
S 2 M and 2-M, with other competing methods in terms of variable selection, especially when there 
is high eollinearity present among the eovariates. We eonsider the horseshoe (HS), the spike-and- 


slab (SS) and the adaptive Lasso (AL) [Zou, 20061. For SS, we used 


/3j|7r,a;,a~7r(5o + (l-vr)Ar(o,aV;), a; ~ IG(3/2, 3/2), 1 - vr ~ Beta(l, 15), ~ IG(3/2, 3/2). 
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(a) PmLE,2 = 1 (b) i3MLE,2 = 1-5 

Figure 1: +ve reverse-shrinkage (blue), -ve reverse-shrinkage (red) for various A, p and r 


The HS is used as in Carvalho et al. [2010| 


I Xj, r, cT^ ~ N(0, Aj-ra^), Xj ~ y^Ca+(0,l), r ~ ^Ca+(0,1), ~ IG(3/2,3/2). 


(4) 


Carvalho et al. [2009,20101 suggested thresholding posterior estimates of kj = 1/(1-|-Aj) at 1/2 to 


perform variable selection in HS. We refer to this as the Hypothesis Testing approach, abbreviated 
as HT. Although a suitable kj can be defined in the GL family we found HT to be most effective 
with HS. It is important to reiterate here that S 2 M is independent of the prior for (3. We also used 
the credible set approach, abbreviated as CS. To implement AL, we used the par cor package in 
R . For HS, the variable selection approaches tried are: S 2 M, 2-M, CS and HT, while for SS we tried 
S 2 M, 2-M, HPPM and MPM. In the first two examples, results are provided by averaging over 25 
replicates of response obtained using the same covariate matrix. For the Bayesian procedures, the 
MCMC was run for 5,000 iterations discarding a burn-in of 2,000. In all the examples, the tuning 
parameter b is set to be 2a‘^. Convergence was monitored using standard tests and diagnostic trace 
plots. In the following tables, the first number in each parenthesis is the masking error and the 
second number is the swamping error. 


10 










































4.1 Simulation example 1 

Four settings for sample size (n), the number of eovariates (p), the number of signals (r) and signal 
strength (B) are (1) n = 50, p = 300, r = 10, B = 4; (2) n = 50, p = 300, r = 10, B = 6; (3) 
n = 100, p = 800, r = 20, -B = 4 and (4) n = 100, p = 800, r = 20, -B = 6. Under eaeh setting, 
we eonsidered an uneorrelated eovariates setting (uncor) and a eorrelated eovariates setting (cor). 
Observations of the eovariate are generated from standard normal distributions independently and 
an intereept is ineluded. The eovariate matrix eorresponding to cor eontains two pairs of eorrelated 
eovariates, and in each pair, one is a signal while the other is a noise predictor. Both correlations 
are above 0.99. 

In general, HS out-performs SS and AL significantly. We found the performance of our pro¬ 
posed method, S 2 M (or 2-M), is consistently better when used in conduction with HS, even when 
high collinearity is present between the eovariates. CS often selects the wrong one between a 
highly correlated pair of eovariates. When the difficulty of task is high, (see the 2nd and the 6th 
columns in Table. 1), CS masked a larger fraction of the true signals. The performance of HS-i-HT 
is excellent as well. However, we have to note here HT is a variable selection method for priors 
from GL family only, while our proposed methods can be broadly applied with various priors. 


Table 1: Masking and Swamping errors 


n, p, r 


n = 50, p = 

300, r = 10 



n = 100, p 

= 800, r = 20 


B 

4 

4 

6 

6 

4 

4 

6 

6 


uncor 

cor 

uncor 

cor 

uncor 

cor 

uncor 

cor 

HS+S 2 M 

(0, 0) 

(0.36,0.36) 

(0, 0) 

(0.28, 0.28) 

(0, 0) 

(0.16,0.16) 

(0, 0) 

(0.04, 0.04) 

HS+2M 

(0,0) 

(0.36,0.36) 

(0, 0) 

(0.28, 0.28) 

(0, 0) 

(0.16,0.16) 

(0, 0) 

(0.04, 0.04) 

HS+CS 

(0, 0) 

(1.8, 0) 

(0, 0) 

(1.42, 0) 

(0, 0) 

(2.52, 0.02) 

(0, 0) 

(0.2, 0) 

HS+HT 

(0, 0) 

(0.4, 0.24) 

(0,0) 

(0.24, 0.28) 

(0, 0) 

(0.2, 0.12) 

(0, 0) 

(0.04, 0.04) 

SS+S 2 M 

(0,0) 

(0.4, 0.4) 

(0.6, 0.76) 

(0.8, 1.08) 

(0.8, 6.32) 

(2.08, 5.56) 

(4.84, 13.56) 

(5.52, 16.32) 

SS+KM 

(0,0) 

(0.4, 0.4) 

(0.64, 0.48) 

(0.8, 1.04) 

(0.8, 6.32) 

(1.96, 5.56) 

(4.96, 13.44) 

(5.48, 16.12) 

SS+HPPM 

(0, 0.64) 

(0.4, 1.32) 

(0.48, 10.44) 

(0.6, 4.2) 

(4.96, 62.8) 

(4.64, 57.36) 

(7.96, 181.28) 

(7.6, 179.6) 

SS+MPM 

(0.04, 0.44) 

(0.4, 0.8) 

(0.52, 1.4) 

(0.64, 1.72) 

(4.16,4.84) 

(3.92, 4.96) 

(7.2, 15.12) 

(8, 15.8) 

AL 

(0.44, 0.72) 

(0.68, 2) 

(0.4, 0.68) 

(1, 0.92) 

(0, 0.2) 

(0.36,0.52) 

(0, 0) 

(0.16, 0.32) 


SS fails to estimate accurately when the difficulty of task is high (see the right half of Table. 1). 
Under this prior, the results using S 2 M (or 2-M) are slightly better than those using other methods, 
especially HPPM, which leads to large swamping errors. In addition, S 2 M-I- HS out-performs AL. 
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4.2 Simulation example 2 

In this example, we eompare S 2 M and 2-M, when the true eoeffieient veetor I3t eontains different 
levels of signal strengths. We set n = 50 with p = 300 and r = 10. There are three 15s and seven 
4s among the 10 signals. The uncor and cor are defined as before. 

The estimation in this simulation example is aeeurate, regardless of using HS, SS or AL. How¬ 
ever, the differenee between two levels of signal strength in (3t is large enough to cause S 2 M and 
2-M to have different performances. S 2 M leads to excellent variable selection, while 2-M always 
masks all the 7 signals of lower signal strength. Again, in this example, S 2 M out-performs the 
competing methods, CS and HT, when the covariates matrix contains correlated covariates. 

Table 2: Masking and Swamping errors 


n,p,r 

n = 50,p 

= 300, r = 10 


uncor 

cor 

HS- 1 -S 2 M 

(0,0) 

(0.22,0.22) 

HS-I-2M 

(7,0) 

(7,0) 

HS-rCS 

(0,0) 

(1.88,0) 

HS-i-HT 

(0,0) 

(0.26,0.22) 

SS+S 2 M 

(0,0) 

(0.64,0.64) 

SS-i-KM 

(7,0) 

(7,0) 

SS-i-HPPM 

(0,0.5) 

(0.6,1.24) 

SS-i-MPM 

(0,0.26) 

(0.64,0.94) 

AL 

(0,0) 

(0.74,0.76) 


In addition, with five noise covariates correlated to (all correlations above 0.98) each of the two 
signal covariates, the two errors (under setting 2) with using HS are (0.46,0.46), (0.46,0.46), (2,0), 
and (1.16,0.12). S 2 M out-performs both CS and HT when there are more correlated covariates. 


4.3 Simulation example 3 


We consider the lymphoma dataset [ [Rosenwald et al.[|2002[ , which consists of 240 observations 
and 7399 features representing 4128 genes from the Lymphochip cDNA microarray. We randomly 
selected 2000 features used as predictors, and randomly selected 30 of them to be signals. The 
response was simulated by the linear model Q, with standardized predictors, a coefficient vector 
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(3, and e ~ iV(0,1). The coefficients of signal predictors in /3 are 4 and the coefficients of noise 
predictors are 0. . Among 2000^ pairs of predictors, there are 510 pairs with correlation above 
0.8, 380 pairs above 0.90 and 323 pairs above 0.95. We used the horseshoe prior with the four 
variable selection methods in the earlier two examples. MCMC was run for 10,000 iterations after 
discarding a bum-in of 5,000. We obtained the following pairs of errors (1,1), (1,1), (4,0), (2,1), 
corresponding to S 2 M, 2-M, CS and HT respectively. 

5. DISCUSSION 

In this article, we developed a simple but useful method for doing variable selection using shrink¬ 
age priors by post-processing posterior samples of the regression coefficients. Our method is es¬ 
sentially applicable to any prior and is based on only one tuning parameter. We observe excellent 
performances of our method in terms of computational efficiency and dealing with correlated co¬ 
variates. The only tuning parameter associated with this method plays a key role to minimize the 
chance of masking while keeping the chance of swamping low as well. Although our current pro¬ 
posal for the tuning parameter works well in most situations we tried, we would like to explore a 
theoretically rigorous way to choose this tuning parameter in future. 

The theoretical results are restricted to priors with only global shrinkage parameters. Although 
we have provided several numerical analysis to better understand the shrinkage properties for the 
horseshoe prior when the covariates are highly correlated, we aim to study reverse-shrinkage more 
rigorously for the general class of global-local priors @ in future. 

6. APPENDIX 

Lemma 6.1. Define function h as 

h{x) = f N{x](3,a‘^{X'X)~^)Ti{(3)df3. (5) 

J/3 


13 


With normal priors on fi and X'X as in Theorem 1, the function h can be written as: 


h{xi,X2) = Ca 


-2 




1 _ (1 _ ^)2p2 \ + V^XlX2) 1 ^, 


where n = 1 / {1 + t'^), C is a constant independent from a, r, p, xi, X2, and 


/i(«;p) =/2 (k;p) = Y—TT— hW,p) = 

1 — (1 — 


PK 


1 — (1 — K)2p2 


(6) 


(7) 


Proof. Through the calculations below, C represents different constant numbers from step to step. 
However, C is always independent from a, r, p, xi and X2. 

h{xi,X 2) = ^ ~ ^ {X 2 - ^2^ + 2p{xi -/3 i){x2- P 2 )] 

First we try to integrate /3i out, obtaining 


h{xi,X 2) — C [ exp +xl +/32 + :^/32 - 2x2/32 + 2pxiX2 - 2pxi/32)^ hd/32, 

( 8 ) 

where h = ^/Wpaexp with Bl = A'{^{px2 + xi - p/32)^ Af = ^ and p = 

Substitute h and back to ([^, and continue to integrate ([^ with respect to (32, obtaining 


h{xuX2) = Ca V 2exp|-^(a;2 + a;2^2pxiX2)}exp|^(pa;2 + xi)2}/2, (9) 

where h = y/^aexp ^ith = A2^[pp(pa;2 + xf - {pxi + 0:2)]^ and 


Substituting I2 with back to and letting k = 1 — p = can be obtained. □ 

Lemma 6.2. Suppose |/3mle,i| = A\I3mlep\, I3mle = {^mle,!, ^mlep), then the normal estima- 
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tors can be written as functions of the MLEs as: 


(^N,l — 

/3n,2 = 


1 i_p 2 (-Ri(/3mle) 


PR20MLe) 

A 


/3mle,i 


1 i_p2 [Ri{i3MLE) — pR2{/3mle)A 


( 10 ) 


/3mLE,2, 


where Ri0mle) = -\{Afi{K) + fsin)) and R20 mle) = -(/ 2 (k) + Af 3 {K)). 


Proof Continuing the Lemma 6.1, the derivatives of the funetion h{xi^ X 2 ) with respeetive to Xi is 


d 

—h{xi, X 2 ) = Ca~‘^{fiXi + f3X3-i)h{xi, X 2 ) i = 1, 2. 

OXi 


where C here denotes the eonstant before the exponential in the funetion h. Define 


R*{Xi,X2) = 


Xi h 


a'^Xi 


ifiXi + f3X3-i) i = l,2. 


( 11 ) 


Considering \0mle,i\ = A\0mle,2\, both R10mle) and R20mle) can be written as funetions of 


A as in Lemma. Lfsing the result of the Proposition 1 in Griffin & Brown |2010|, where 


S(/3) = a^(X'X)-^[R*(/3MLE) 0; 0 R;(0 mle)] = (X'X)-^IR3(0mle), R2(/3mle)] 


in this case, ( [T^ can be obtained. 


□ 


Lemma 6.3. Define 


83 = 


(fi. - ‘t) 

1 — 


^2 


(^i?2 ~ pRl^ 
1 — 


( 12 ) 


then —1 < /i(k; p) = f 2 {,R] p) < fsi^', p) < 0, 0 < Si < 1, S 2 < 1, for any < p < 1, A > 1, 
and r > 0. 


Proof fi = /a and /s < 0 can be directly obtained from the definitions Q. 


/i(k;p) > -1 ^ 


(p^ — 1 — p'^k)k 

1 — (1 — K)2p2 


> —1 -V^ p^K — K — p‘^hf > (k^ + 1 — 2n)p‘^ — 1. 


(13) 
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For the last inequality in ( |T^ , LHS — RHS = 2p^K(l — k) + (1 — p^)(l — k) > 0. Thus fi > —1. 


/s > /l (p2 - 1 - p‘^k)k < -pK^. 


(14) 


For the inequality in ( jll] ), LHS — RHS = (1 — p){pK^ — k — p) < 0, this leads to /s > /i. 
Since /i, /2 and /s are always negative, Ri and R 2 are always positive. 


^ ^ pR2 Ri P 

5i > 0 i?i > —— . 

A R 2 A 


(15) 


We proved in the following proof of Theorem 3.1 Considering the fact 


R 2 ^ A+pA2 


A+p ^ p_ 
A+pA'2 ^ A’ 


the last inequality of ( [T5| ) is true. Thus Si > 0. 

We prefer to use p = 1 — k = r^/[l + r^] to prove Si < 1: 


Si < 1 Ri - < 1 - p^ ARi - pi?2 < A - Ap^. 


(16) 


For the last inequality in (161, we can finally obtain LHS — RHS = (1 — p^){pp‘^ — PL + Ap^p^ ■ 
Ap). Since both pp{p — 1) and Ap{pp — 1) are negative, LHS — RHS is less than 0. 


^2 < 1 ^ ^2 - pRiA < 1 - p2. (17) 

For the inequality in ( jlv] ), LHS — RHS = (1 — p‘^)[{App‘^ — App) + (p^/i^ — p)]. Since both 
{App^ — App) and (p^p^ — p) are negative, thus LHS is indeed less than RHS. □ 

Now we are ready to give the proof of Theorem [3T] 

Proof. Case 1: the Theorem is true when 52 < 0. That is because Si is always between 0 and 
1, causing \Pn,i\ will always be less than |/3 mle,i|- But \Pn, 2 \ will be greater than I^mleM since 
(1 — ^ 2 ) must be greater than 1 in this case. 
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Case 2: when both Si and S 2 are between 0 and 1, we have 


(3) 


^ 

- > 1 - — - 

^2 R 2 - pRiA 


> 1 ARi - pR2 > AR 2 - pRiA^ 


(18) 


For the last inequality in ( [T8] ), substituting Ri and R 2 derived from Lemma 6.2 we ean obtain 
LHS - RHS = p{l - p^){A^ -1){1 -k)> 0. 

Note this verifies ^ whieh is needed for proving S'! > 0 earlier. □ 

Lemma 6.4. The function h is defined as 0. under the model 0 with the horseshoe prior on fi, 
then 

h{xi,X 2 )=C F{ki, K 2 ; p)E{ki, K 2 ;p,Xi,X 2 )dKidK 2 , (19) 

J Ki,K2 

where Hi = 1/[1 + i = 1,2, C is a constant independent from (xi, X 2 , p, Xi, X 2 ), and 

F{ki,K2]p) = [l-(l-Ki)(l-K2)p^]“^[l-(l-r2)Ki]“^[l-(l-r2)K2]“Hl-«^i)“^(l-«^2)“5 
E{ki, K 2 , p, Xi, X2) = exp I ^(/ix? + f2xl + 
with 

fi{Hl,H2]p) = [(p^ - 1 - p^Hii_i)Hi][l - (1 - Kl)(l - H2)p^]~^ i = l,2 
f 3 ( 1 ^ 1 , 1 ^ 2 ; p) = -pKiK2[l - (1 - Kl)(l - K2)p‘^]~^. 

It follows that the horseshoe estimator can be represented as the right hand side where 

Ri is 

, X I j [fi{Ki,K2)Xi + f‘i{Ki,K2)Xii-i]F{Ki,K2)E{Ki,K2)dKidK2 

Ri[xi,X 2 ) = -- -^- r -— X,,, - 1 = 1,2. 


Xi 


E(ki, K2)E(ki, K2)dKidK2 


( 20 ) 


Proof With the horseshoe prior being applied, the 7r(/)) in (1^ beeomes 


^(/^) = / 7r(/9|A)7r(A)(iA = / iV(/)i; 0, (T^r^A|)iV(/32; 0, cr^r^A2)vr(Ai)7r(A2)(iAi(iA2 

J \ J Ai ,A2 


The funetion h now is an integral with respeet to (/9i,/32, Ai, A 2 ). The triek here is to integrate 
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(3i^(32 out, then h will be an integral with respeet to Ai, A 2 only: 


h{x) = ^ 7r(A) |/^ N{x- /3, a^{X>X)-^)7r{/3\X)d(3} dX. (21) 

Note 7r(/3| A) is a two-dimension normal distribution, with ki and K 2 defined in this Lemma, we 
can easily integrate /3i, /32 out through the similar approach as we did in the proof of Lemma [6T[ 
and the integral with respect to fd inside the braces above is 

[ = C — , \/ 1 2 E(ki, k, 2; p, Xi, X 2 ), 

where C is a constant independent from {xi,X 2 -, p, Ai, A 2 ). 

The prior on Aj is 7r(Aj) = 2/[7r(l + A^)] then the prior on Ki is followed as: 


TT 1 — (1 — T^)Ki 


Substituting L and ni^Ki) back to (21), (19) is obtained. Considering 


d 

—h{xi,X2) = Ca 


-2 


{fiXi + fzX2,-i)F{Ki, K2 )E{ki, K2)dKidK2 


' A2 


and defining Ri = a‘^R* — {-^h)/{xih ), use the result of Proposition 1 in 
again, (flO]) is obtained. 


Griffin & Brown 


[20101 


□ 
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